library("broom")
library("dplyr")
library("ggplot2")
library("ggrepel")
library("modelsummary")
library("ggdag")
library("purrr")
library("stringr")
library("ggpubr")
library("ggpattern")
# For Maps
library("sf")
library("stars")
library("sp")
library("rnaturalearth")
library("rnaturalearthdata")
library("ggspatial")
library("dichromat")
# For interactive maps
library("leaflet")
library("leaflegend")
library("htmltools")
# Removing previous files and setting the
# directory
rm(list = ls())
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/lab10/")Lab 10: Statistics
Analysis of RCT (Randomized Control Trial) Data
1 Introduction
In this hypothetical situation (the data is made up), the World Bank is planning on launching a program designed to reduce the risk of malaria in Malawi by providing insecticide-treated bed nets. So they have funding to run a randomized controlled trial (RCT) in three districts in Malawi that are very close to the Malawi lake. The World Bank randomly selected 1000 individuals from these districts and they want to investigate whether receiving such nets has any effect on people’s risk of malaria.
In this document, we want to calculate \(E(\text{Post-Malaria Risk | Net})\)
In a first instance, we want to draw a causal model which explains the effect of the insecticide-treated bed nets on participant risk of malaria, given the confounding caused by age, sex, and income. We think that the risk of malaria will be impacted by age, sex, income, prior risk of malaria, and whether people used the insecticide-treated net. See the models below (together with the code to produce them).
Show the code
malaria_dag <- dagify(post_malaria_risk ~ net + age + sex + income + pre_malaria_risk,
net ~ income + pre_malaria_risk,
exposure = "net",
outcome = "post_malaria_risk",
labels = c(post_malaria_risk = "Post Malaria Risk",
net = "Mosquito Net",
age = "Age",
sex = "Sex",
income = "Income",
pre_malaria_risk = "Pre Malaria Risk"),
coords = list(x = c(net = 2, post_malaria_risk=7, income = 5, age = 2,
sex = 4, pre_malaria_risk=6),
y = c(net = 3, post_malaria_risk=2, income = 1, age = 2,
sex = 4, pre_malaria_risk=4)))
bigger_dag <-data.frame(tidy_dagitty(malaria_dag))
bigger_dag$type<-NA
bigger_dag$type<-"Confounder"
bigger_dag$type[bigger_dag$name=="post_malaria_risk"]<-"Outcome"
bigger_dag$type[bigger_dag$name=="net"]<-"Intervention"
min_lon_x<-min(bigger_dag$x)
max_lon_x<-max(bigger_dag$x)
min_lat_y<-min(bigger_dag$y)
max_lat_y<-max(bigger_dag$y)Show the code
col = c("Outcome"="green3",
"Intervention"="red",
"Confounder"="grey60")
order_col<-c("Outcome", "Intervention", "Confounder")
example_dag<-ggplot(data = bigger_dag, aes(x = x, y = y, xend = xend, yend = yend, color=type)) +
geom_dag_point() +
geom_dag_edges() +
coord_sf(xlim = c(min_lon_x-1, max_lon_x+1), ylim = c(min_lat_y-1, max_lat_y+1), expand = FALSE)+
scale_colour_manual(values = col,
name = "Group",
breaks = order_col)+
geom_label_repel(data = subset(bigger_dag, !duplicated(bigger_dag$label)),
aes(label = label), fill = alpha(c("white"),0.8))+
theme_bw()+
labs(x = "", y="")
ggsave(example_dag, file = "./graphs/example_dag1.jpg",
height = 20, width = 25,
units = "cm", dpi = 200)
example_dagIn the experiment, we want to manipulate access to the program (providing nets). Thus, we can calculate \(E(\text{malaria risk | net})\). Following the rules of causal diagrams, we can delete all the arrows going into the program node, because the nets are randomly assigned.
Show the code
malaria_dag2 <- dagify(post_malaria_risk ~ net + age + sex + income + pre_malaria_risk,
exposure = "net",
outcome = "post_malaria_risk",
labels = c(post_malaria_risk = "Post Malaria Risk",
net = "Mosquito Net",
age = "Age",
sex = "Sex",
income = "Income",
pre_malaria_risk = "Pre Malaria Risk"),
coords = list(x = c(net = 2, post_malaria_risk=7, income = 5, age = 2,
sex = 4, pre_malaria_risk=6),
y = c(net = 3, post_malaria_risk=2, income = 1, age = 2,
sex = 4, pre_malaria_risk=4)))Show the code
bigger_dag <-data.frame(tidy_dagitty(malaria_dag2))
bigger_dag$type<-NA
bigger_dag$type<-"Confounder"
bigger_dag$type[bigger_dag$name=="post_malaria_risk"]<-"Outcome"
bigger_dag$type[bigger_dag$name=="net"]<-"Intervention"
min_lon_x<-min(bigger_dag$x)
max_lon_x<-max(bigger_dag$x)
min_lat_y<-min(bigger_dag$y)
max_lat_y<-max(bigger_dag$y)Show the code
col = c("Outcome"="green3",
"Intervention"="red",
"Confounder"="grey60")
order_col<-c("Outcome", "Intervention", "Confounder")
example_dag2<-ggplot(data = bigger_dag, aes(x = x, y = y, xend = xend, yend = yend, color=type)) +
geom_dag_point() +
geom_dag_edges() +
# geom_text(data = bigger_dag, aes(x = x, y = y, label = label),
# hjust=0.4, vjust=0, color = "black")+
coord_sf(xlim = c(min_lon_x-1, max_lon_x+1), ylim = c(min_lat_y-1, max_lat_y+1), expand = FALSE)+
scale_colour_manual(values = col,
name = "Group",
breaks = order_col)+
geom_label_repel(data = subset(bigger_dag, !duplicated(bigger_dag$label)),
aes(label = label), fill = alpha(c("white"),0.8))+
theme_bw()+
labs(x = "", y="")
ggsave(example_dag2, file = "./graphs/example_dag2.jpg",
height = 20, width = 25,
units = "cm", dpi = 200)
example_dag22 Loading the Data
Create a folder called lab10. Within it, create another folder and call it data. Go to the following links and download the following:
Place all the data in the data foler.
3 Inspecting the Area
Let’s assume that the experiment has been conducted and the data has come in. Let us first have a first look at the area where the World Bank ran this RCT.
3.1 Loading the Shape Files
st_read reads the shapefiles from your hardrive. read_stars reads a tif file that shows the elevation for your area of interest. ne_countries loads a shapefile with all the countries around the world.
#Country shapefile
mwi <- st_read("./data/malawi-latest-free/gadm41_MWI_0.shp")
#Region 1 shapefile
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp")
#Region 3 shapefile
mwi3 <- st_read("./data/malawi-latest-free/gadm41_MWI_3.shp")
#Selected communes shapefile
select_comm <- st_read("./data/malawi-latest-free/selected_communes3.shp")
#Raster with elevation
elev <- read_stars("./data/elev.tif")
#World Shapefile
world <- ne_countries(scale = "medium", returnclass = "sf")
#Switched off spherical geometry (s2)
sf::sf_use_s2(FALSE)3.2 Simplifying the shapefiles for better rendering
st_simplify simplifies the polygons for quicker rendering.
#Simplifying polygons for quicker rendering
#Note: you have to play around with dTolerance to figure out the appropriate value
#Too little - your computer will need more time to render the file
#Too much - the country polygons will be too simple
mwi<-st_simplify(mwi, dTolerance = 0.005)
mwi1<-st_simplify(mwi1, dTolerance = 0.005)
mwi3<-st_simplify(mwi3, dTolerance = 0.005)
select_comm<-st_simplify(select_comm, dTolerance = 0.0005)
world<-st_simplify(world, dTolerance = 0.005)3.3 Calculating the appropriate zoom level for Malawi
We now calculate the zoom level for the entire country of Malawi.
#Calculating the appropriate zoom level
mwi_extent <- st_bbox(mwi)
min_lon_x<-mwi_extent["xmin"]
max_lon_x<-mwi_extent["xmax"]
min_lat_y<-mwi_extent["ymin"]
max_lat_y<-mwi_extent["ymax"]We now create a new polygon to emphasize where Malawi is on the World Map.
#Creating a dataframe to emphasize Malawi and adding some error
error = 5
sp_df = data.frame("lon" = c(min_lon_x - error, max_lon_x + error,
min_lon_x - error, max_lon_x + error),
"lat" = c(min_lat_y - error, min_lat_y - error,
max_lat_y + error, max_lat_y + error))We now make a spatial polygon out of it so that we can highlight where Malawi is on the world map.
#Creating polygon to indicate zoom
poly <- sp_df %>%
st_as_sf(coords = c("lon", "lat"),
crs = st_crs(mwi)) %>%
st_bbox() %>%
st_as_sfc()
#Examining what we just created
ggplot() +
geom_sf(data = mwi, fill="red", color = "red", linewidth = 0.9)+
geom_sf(data = poly, fill=NA, color = "red", linewidth = 0.3)3.4 Cropping the raster
We now crop our raster only to the area of interest. Again, if we have too much, our computer will have a hard time rendering the file.
#Step1: Reducing raster resolution: Higher dx/day values mean lower resolution
elev2<-st_warp(elev, st_as_stars(st_bbox(elev), dx = 0.05, dy = 0.05))
#Step2: Removing really high pixel values for good contrast in MWI
elev2[elev2 > 2000] <- NA
#Step3: Cropping
dem_mwi<-st_crop(elev2, poly)map_word<-ggplot() +
geom_sf(data = world, fill=NA, color = "black", linewidth = 0.3)+
geom_sf(data = mwi, fill="red", color = "red", linewidth = 0.7)+
geom_sf(data = poly, fill=NA, color = "red", linewidth = 0.9)+
theme_bw()+
labs(x = "Longitude", y="Latitude")+
ggtitle("Malawi on the World Map")+
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
axis.title=element_text(size=14),
plot.title = element_text(hjust = 0.5),
legend.position = c(1, 0),
#Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
#and c(1,1) corresponds to the "top right" position.
legend.box.background = element_rect(fill='white'),
legend.background = element_blank(),
legend.text=element_text(size=12))
#ggspatial::annotation_scale(location = 'tr')
map_wordggsave(map_word, file="./graphs/map_world.jpg", height=15, width=30, units = "cm", dpi=300)3.5 Mapping the region of interest - Malawi
#Selecting only the region of interest
mwi1_lab<-subset(mwi1, NAME_1=="Nkhata Bay")
#Defining colors for raster
grey_palette <- colorRampPalette(c("grey90", "grey30"))
# Generate a sequence of colors based on the number of breaks desired
num_breaks <- 10 # Adjust this based on your data
grey_colors <- grey_palette(num_breaks)This is the region of Nkhata Bay where the World Bank ran the RCT. Within the region of Nkhata Bay, the World Bank chose three settlements close to the lake where people could be exposed to malaria.
We now want to draw a new map to emphasize where Nkhata Bay is.
error<-0.2
map_malawi<-ggplot() +
geom_stars(data=elev2, show.legend=FALSE) +
scale_fill_gradientn(colours = grey_colors)+
geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
geom_sf_text(data=mwi1_lab, aes(label = NAME_1), size=3.5)+
theme_bw()+
labs(x = "Longitude", y="Latitude")+
ggtitle("Nkhata Bay in Malawi")+
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
axis.title=element_text(size=14),
plot.title = element_text(hjust = 0.5),
legend.position = c(1, 0),
#Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
#and c(1,1) corresponds to the "top right" position.
legend.box.background = element_rect(fill='white'),
legend.background = element_blank(),
legend.text=element_text(size=12))+
coord_sf(xlim = c(min_lon_x-5*error, max_lon_x+5*error), ylim = c(min_lat_y-error, max_lat_y+error))+
ggspatial::annotation_scale(location = 'tr')
#Saving the map
ggsave(map_malawi, file="./graphs/map_malawi.jpg", height=15, width=15, units = "cm", dpi=300)
map_malawi3.6 Calculating the appropriate zoom level for the Selected Communes
We now calculate the zoom level for the region of interest - Nkhata Bay.
#Calculating the appropriate zoom level for the selected commune
mwi_extent <- st_bbox(select_comm)
min_lon_x<-mwi_extent["xmin"]
max_lon_x<-mwi_extent["xmax"]
min_lat_y<-mwi_extent["ymin"]
max_lat_y<-mwi_extent["ymax"]#Re-reading the shapefiles.
#We now want more accurate polygons since we are zooming in.
#This is why we load them again
mwi <- st_read("./data/malawi-latest-free/gadm41_MWI_0.shp")
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp")
mwi3 <- st_read("./data/malawi-latest-free/gadm41_MWI_3.shp")
select_comm <- st_read("./data/malawi-latest-free/selected_communes3.shp")
dem <- read_stars("./data/elev.tif")
#Step1: Reducing raster resolution: Higher dx/day values mean lower resolution
elev2<-st_warp(dem, st_as_stars(st_bbox(elev), dx = 0.02, dy = 0.02))
#Step2: Removing really high pixel values for good contrast in MWI
elev2[elev2 > 2000] <- NA
#Step3: Creating a dataframe to emphasize Malawi and adding some error
error = 0.5
sp_df = data.frame("lon" = c(min_lon_x - error, max_lon_x + error,
min_lon_x - error, max_lon_x + error),
"lat" = c(min_lat_y - error, min_lat_y - error,
max_lat_y + error, max_lat_y + error))
#Step4: Creating polygon to indicate zoom
poly <- sp_df %>%
st_as_sf(coords = c("lon", "lat"),
crs = st_crs(mwi)) %>%
st_bbox() %>%
st_as_sfc()
#Step5: Cropping
dem_mwi<-st_crop(elev2, poly)
world <- ne_countries(scale = "medium", returnclass = "sf")
#Turning off spherical geometries
sf::sf_use_s2(FALSE)#We also want to include roads in our map, to provide more context.
#Reading the OSM roads data
roads<-st_read("./data/malawi-latest-free/gis_osm_roads_free_1_crop2.shp")
#Step1: Dissolving polylines
roads2 <- st_combine(roads)
#Step2: Turning roads2 into an sf object
roads3<-st_as_sf(roads2)
#Step3: Only the roads of interest
roads_crop1 <- sf::st_intersection(roads3, select_comm)3.7 Mapping the Individuals
Let us first look at our data.
rct_data <- read.csv(file = './data/rct_data_geo_malawi.csv')
glimpse(rct_data, n=10)Rows: 1,000
Columns: 10
$ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ lat <dbl> -11.27476, -11.15300, -11.13967, -11.25544, -11.1777…
$ lon <dbl> 34.03006, 34.14594, 34.14297, 34.14619, 34.20031, 34…
$ income <dbl> 409.4701, 520.8072, 581.3331, 324.0727, 532.1844, 53…
$ age <int> 10, 35, 7, 43, 45, 51, 24, 17, 38, 69, 10, 21, 32, 3…
$ sex <chr> "Man", "Woman", "Woman", "Woman", "Man", "Woman", "M…
$ health <dbl> 82.78928, 81.18602, 80.57399, 82.84023, 87.40737, 57…
$ net <int> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1…
$ malaria_risk_pre <dbl> 35.74454, 36.65260, 22.87415, 35.41934, 27.49873, 42…
$ malaria_risk_post <dbl> 42.52199, 34.27589, 32.67552, 43.87256, 27.16945, 41…
error<-0.05
map_malawi_exp<-ggplot() +
#Plotting the elevation raster
geom_stars(data=elev2, show.legend=FALSE) +
#Making elevation grey
scale_fill_gradientn(colours = grey_colors)+
#Plotting the selected commune
geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
geom_sf(data = mwi3, fill=NA, color = "red", linewidth = 0.4)+
geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
#Plotting the roads
geom_sf(data = roads_crop1, fill=NA, color = "black", linewidth = 0.3)+
#Plotting the RCT data
geom_point(data = rct_data, aes(x = lon, y= lat), color = "red", size=0.3)+
theme_bw()+
labs(x = "Longitude", y="Latitude")+
ggtitle("The Three Districts Selected in Nkhata Bay, Malawi\n Location of 1000 Individuals\n For the Experiment")+
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
axis.title=element_text(size=14),
plot.title = element_text(hjust = 0.5),
legend.position = c(1, 0),
#Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
#and c(1,1) corresponds to the "top right" position.
legend.box.background = element_rect(fill='white'),
legend.background = element_blank(),
legend.text=element_text(size=12))+
coord_sf(xlim = c(min_lon_x-3*error, max_lon_x+3*error),
ylim = c(min_lat_y-error, max_lat_y+error))+
#Adding a scale to the map
ggspatial::annotation_scale(location = 'tr')
#Saving the picture in folder one your machine called "graphs"
ggsave(map_malawi_exp, file="./graphs/map_malawi_exp.jpg", height=15, width=15, units = "cm", dpi=300)
map_malawi_exp3.8 Mapping the Individuals in an Interactive Way
Here is how we can make an interactive map.
rct_data$treat<-NA
rct_data$treat[rct_data$net==1]<-"Treatment"
rct_data$treat[rct_data$net==0]<-"Control"
#name <- colorFactor(topo.colors(3), domain = carrefour$name)
pal <-
colorFactor(palette = c("#8ED5D8", "#F2B6B2"),
levels = c("Control", "Treatment"))
leaflet(rct_data) %>% addTiles() %>%
addPolygons(data = select_comm,
group = "Nkhata Bay",
# stroke parameters
weight = 3, color = "blue",
# fill parameters
fillColor = "blue", fillOpacity = 0.0)%>%
addCircleMarkers(lng = ~lon, lat = ~lat,
label = ~htmlEscape(paste("Indiv.: ", id)), radius = 4, color = ~pal(treat))%>%
addLegendFactor(
pal=pal,
values = rct_data$treat,
position = 'topright',
shape = 'circle',
width = 10,
height = 10,
opacity = 0.5)%>%
addScaleBar(
position = c("bottomright"))4 Data Analysis
We will now do the data analysis. Until now, we looked at various ways of mapping our data.
4.1 Summary statisics
Before we conduct any analysis, it is important to get a sense of our data, the minimum and maximum, the mean, and whether our variables are normally distributed or not. Let us now create a summary statistics table:
datasummary_skim(rct_data)| Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
|---|---|---|---|---|---|---|---|---|
| id | 1000 | 0 | 500.5 | 288.8 | 1.0 | 500.5 | 1000.0 | |
| lat | 1000 | 0 | −11.2 | 0.1 | −11.3 | −11.2 | −11.1 | |
| lon | 1000 | 0 | 34.1 | 0.1 | 34.0 | 34.1 | 34.2 | |
| income | 1000 | 0 | 498.0 | 74.8 | 245.3 | 497.0 | 739.7 | |
| age | 76 | 0 | 28.9 | 16.1 | 1.0 | 27.0 | 89.0 | |
| health | 1000 | 0 | 69.1 | 12.9 | 30.9 | 69.0 | 100.0 | |
| net | 2 | 0 | 0.5 | 0.5 | 0.0 | 1.0 | 1.0 | |
| malaria_risk_pre | 1000 | 0 | 38.0 | 9.8 | 5.0 | 37.9 | 70.0 | |
| malaria_risk_post | 1000 | 0 | 40.4 | 10.4 | 5.0 | 40.5 | 70.0 |
Income and malaria seem to be normally distributed. The net variable is binary, as expected. Age and health seem to be skewed to the left.
4.2 Checking balance
Before calculating the effect of the program, we should first check how well balanced assignment was. As you can tell from the results below, assignment to the program was pretty much split 50/50.
people_randomized<-rct_data %>%
count(net) %>%
mutate(prop = n / sum(n))
people_randomizedWe also want to check to see how well balanced the treatment and control groups were in participants’ pre-treatment characteristics:
rct_data$sex_num<-as.numeric(as.factor(as.character(rct_data$sex)))
rct_data$sex_num[rct_data$sex_num==2]<-0
people_randomized<-rct_data %>%
group_by(net) %>%
summarize(prop_male = mean(sex_num),
avg_age = mean(age),
avg_malaria_risk_pre = mean(malaria_risk_pre),
avg_health = mean(health),
avg_income = mean(income))
people_randomizedThese variables appear fairly well balanced. To check that there aren’t any statistically significant differences between the groups, we should also conduct some t-tests.
t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: sex_num by net
t = -0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.06597315 0.05564916
sample estimates:
mean in group 0 mean in group 1
0.3933054 0.3984674
t.test(age ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: age by net
t = -1.275, df = 997.37, p-value = 0.2026
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-3.2870789 0.6979223
sample estimates:
mean in group 0 mean in group 1
28.25523 29.54981
t.test(malaria_risk_pre ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: malaria_risk_pre by net
t = 0.015818, df = 997.7, p-value = 0.9874
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-1.208182 1.227819
sample estimates:
mean in group 0 mean in group 1
37.96975 37.95993
t.test(health ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: health by net
t = -0.40132, df = 996.97, p-value = 0.6883
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-1.931451 1.275577
sample estimates:
mean in group 0 mean in group 1
68.90312 69.23106
t.test(income ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: income by net
t = 0.19875, df = 991.26, p-value = 0.8425
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-8.353269 10.236022
sample estimates:
mean in group 0 mean in group 1
498.4966 497.5552
We can also simply ask to be shown the p-values.
t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))[3]$p.value
[1] 0.8677374
t.test(age ~ net, data = rct_data, alternative=c("two.sided"))[3]$p.value
[1] 0.2026112
t.test(malaria_risk_pre ~ net, data = rct_data, alternative=c("two.sided"))[3]$p.value
[1] 0.9873827
t.test(health ~ net, data = rct_data, alternative=c("two.sided"))[3]$p.value
[1] 0.6882691
t.test(income ~ net, data = rct_data, alternative=c("two.sided"))[3]$p.value
[1] 0.8424983
There seems to be a slight difference when it comes to malaria risk and health, where the people who are treated, have slightly greater risk of malaria. At the same time, they also seem to have greater health. The differences however do not seem to be highly statistically significant and the differences are not that great. But let us also plot these differences.
4.3 Plotting Differences
The following graphs plot the point ranges for the variables of interest.
#The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean
plot_diff_sex <- ggplot(rct_data, aes(x = treat, y = sex_num, color = treat)) +
stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Prop. male")
plot_diff_age <- ggplot(rct_data, aes(x = treat, y = age, color = treat)) +
stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Age")
plot_diff_malaria_risk_pre <- ggplot(rct_data, aes(x = treat, y = malaria_risk_pre, color = treat)) +
stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Mal. Risk Bef.")
plot_diff_health <- ggplot(rct_data, aes(x = treat, y = health, color = treat)) +
stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Health")
plot_diff_income <- ggplot(rct_data, aes(x = treat, y = income, color = treat)) +
stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Income")
ggarrange(plot_diff_sex, plot_diff_age,
plot_diff_malaria_risk_pre,
plot_diff_health, plot_diff_income, ncol=2, nrow=3, common.legend = TRUE)Let us now also include the distributions of these variables by group (treatment and control group).
plot_diff_sex_d<-ggplot(rct_data, aes(pattern = sex, x = treat)) +
geom_bar(data = rct_data,
position = "dodge",
aes(fill=treat),
alpha = 0.5) +
geom_bar_pattern(position = "dodge",
aes(pattern=sex),
fill = NA,
linewidth=0.5,
color="black")+
scale_pattern_manual(name = NULL,
values = c(
Man = "stripe",
Woman = "circle"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_age_d<-ggplot(rct_data, aes(x = age, fill = treat, group=treat)) +
guides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 2000, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/2000,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_malaria_risk_pre_d<-ggplot(rct_data, aes(x = malaria_risk_pre, fill = treat, group=treat)) +
guides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 2000, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/2000,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_health_d<-ggplot(rct_data, aes(x = health, fill = treat, group=treat)) +
guides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 2000, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/2000,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_income_d<-ggplot(rct_data, aes(x = income, fill = treat, group=treat)) +
guides(color = "none") +
geom_histogram(binwidth = 10,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 10000, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/10000,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
figure1<-ggarrange(plot_diff_sex, plot_diff_sex_d, ncol=2, common.legend = TRUE)
annotate_figure(figure1, top = text_grob("Sex"))figure2<-ggarrange(plot_diff_age, plot_diff_age_d, ncol=2, common.legend = TRUE)
annotate_figure(figure2, top = text_grob("Age"))figure3<-ggarrange(plot_diff_malaria_risk_pre, plot_diff_malaria_risk_pre_d, ncol=2, common.legend = TRUE)
annotate_figure(figure3, top = text_grob("Malaria Risk Pre-Intervention"))figure4<-ggarrange(plot_diff_health, plot_diff_health_d, ncol=2, common.legend = TRUE)
annotate_figure(figure4, top = text_grob("Health Score"))figure5<-ggarrange(plot_diff_income, plot_diff_income_d, ncol=2, common.legend = TRUE)
annotate_figure(figure5, top = text_grob("Income"))4.4 Estimating the Difference
So, we are interested in the causal effect of the program - \(E(\text{Post-Malaria Risk | Net})\). We can find this causal effect by calculating the average treatment effect or ATE:
\[ ATE = E(\overline{\text{Post-Malaria Risk}} | \text{Net = 1}) - E(\overline{\text{Post-Malaria Risk}} | \text{Net = 0}) \]
This is simply the average outcome for people in the program minus the average outcome for people not in the program. We can calculate the group averages:
people_randomized<-rct_data %>%
group_by(net) %>%
summarize(avg_malaria_risk_post = mean(malaria_risk_post))
people_randomizedThat’s 45.66210 -35.57936, or 10.08274, which means that the program caused an decrease in the risk of malaria of 10.08274, on average. We can also simply run a regression with post-program malaria risk as the outcome variable and the program indicator variable as the explanatory variable. The results below show two models: (1) in which we simply control for the treatment status and one in which we add a series of controls. We can see that we get the same effect of 10.08274, which we calculated manually. This is very close in magnitude to 10.235, which is the effect, once we control for income, age, and sex.
model_rct1 <- lm(malaria_risk_post ~ net, data = rct_data)
summary(model_rct1)
Call:
lm(formula = malaria_risk_post ~ net, data = rct_data)
Residuals:
Min 1Q Median 3Q Max
-30.5794 -6.4719 0.0798 6.1909 28.9747
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.6621 0.4146 110.14 <2e-16 ***
net -10.0827 0.5738 -17.57 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.064 on 998 degrees of freedom
Multiple R-squared: 0.2363, Adjusted R-squared: 0.2355
F-statistic: 308.7 on 1 and 998 DF, p-value: < 2.2e-16
model_rct1 <- lm(malaria_risk_post ~ net, data = rct_data)
model_rct2 <- lm(malaria_risk_post ~ net + income + age + sex_num, data = rct_data)
models<-list("Risk of Malaria" = model_rct1,
"Risk of Malaria" = model_rct2)
cm <- c('net'='Received Mosquito Net (Treatment)',
"income" = "income",
"age" = "age",
"sex_num" = "sex",
"(Intercept)"="Intercept")
modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))| Risk of Malaria | Risk of Malaria | |
|---|---|---|
| Received Mosquito Net (Treatment) | −10.083*** | −10.235*** |
| (0.574) | (0.527) | |
| income | −0.045*** | |
| (0.004) | ||
| age | 0.083*** | |
| (0.016) | ||
| sex | 0.518 | |
| (0.539) | ||
| Intercept | 45.662*** | 65.608*** |
| (0.415) | (1.866) | |
| Num.Obs. | 1000 | 1000 |
| R2 | 0.236 | 0.359 |
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Let us now also plot this difference.
ggplot(rct_data, aes(x = malaria_risk_post, fill = treat, group=treat)) +
guides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 1500, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/1500,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)ggplot(rct_data, aes(x = treat, y = malaria_risk_post, color = treat)) +
stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Post Malaria Risk")5 Conclusion
We can now conclude that indeed the provision of insecticide-treated nets reduces the risk of malaria by 10.083 units. Thus, if the we want to reduce such a health risk to the wider region, it might be effective to provide such nets to all vulnerable populations.
Let us recap the steps in the analysis of RCTs
- Step 1: Check balance
- Step 2: Calculate the difference
- Step 3: Show the difference